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Abstract 

Many appplications in computational science are sufficiently compute- 
intensive that they depend on the power of parallel computing for viability. 
For all but the "embarrassingly parallel" problems, the performance de- 
pends upon the level of granularity that can be achieved on the computer 
platform. 

Our computational particle physics applications require machines that 
can support a wide range of granularities, but in general, compute-intensive 
state-of-the-art projects will require finely grained distributions. Of the 
different types of machines available for the task, we consider cluster com- 
puters. 

The use of clusters of commodity computers in high performance com- 
puting has many advantages including the raw price/performance ratio 
and the flexibility of machine configuration and upgrade. Here we focus 
on what is usually considered the weak point of cluster technology; the 
scaling behaviour when faced with a numerically intensive parallel compu- 
tation. To this end we examine the scaling of our own applications from 
numerical quantum field theory on a cluster and infer conclusions about 
the more general case. 
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1 Introduction 



A compute- intensive application might have to be distributed over many compute 
nodes in order for it to run at a viable speed. One needs to know how well a 
platform can support the required level of granularity. In this paper we examine 
this in the case of cluster computers. 

The use of clusters of computers in high performance computing has achieved 
some success with the price/performance advantage of using low-cost and yet 
high-speed commercial off-the-shelf (COTS) processors to build a fast machine 
with the budget available. The significant presence of clusters in the TOP500 
list P attests to the power of this approach. 

Cluster computers are also becoming more widely accepted with the growing 
ease with which they can be configured and maintained; there are now numerous 
open source and commercial "Cluster in a box" packages available. It can also 
be important that portable application code can be developed and run on these 
machines, using standard operating systems, libraries, tools and parallel APIs. 

Essential for the use of clusters in many areas of high performance computing is 
the presence of high performance system area network technology. The question 
is whether this is good enough to make parallel computation on clusters viable in 
the face of very fast communications built into MPP or SMP /NUMA machines. 

In this paper we intend to make some quantitative statements on the subject 
by examining the scaling of some codes from our parallel application, viz. Lat- 
tice QCD. In section El we give a brief description of Lattice QCD, emphasising 
the computational task involved. Then in section El we describe how our code 
implements this with particular reference to the parallelisation. We proceed to 
describe in section IH the scaling tests we apply and present their results. These 
are summarised and discussed in section after which we present our conclu- 
sions. 



2 Lattice QCD 

Lattice QCD is a numerical evaluation of Quantum Chromodynamics, the theory 
of the strong interaction, which takes its name from the use of a four- dimensional 
hypercuboid lattice to represent a discretised spacetime (see e.g. [21E1 for a recent 
overview of the subject). In the lattice theory, each lattice site (vertex of the 
lattice) represents a spacetime point. 

These computations are demanding enough that they routinely make much use 
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of supercomputers, and the grid-based formulation of the problem makes it most 
suitable for data decomposition on parallel machines. 

One of the central defining features of a Lattice QCD calculation is the discretised 
representation of the Dirac fermion operator which describes the interaction of 
quarks in the theory. On the lattice, the discretised operator takes the form of 
a fermion matrix. The algorithms used for Lattice QCD calculations are such 
that the bulk of the computer time is spent in the inversion of this matrix. 
There is a variety of representations of the fermion matrix which have different 
properties on the lattice but they should all become identical to the continuum 
Dirac operator in the limit of zero lattice spacing. The particular formalism used 
in this work is that originally due to Wilson j3]. The Wilson fermion matrix is 

Mxy = Sxy nDxy (1) 

where the indices x and y refer to lattice sites and G M is the hopping parameter, 
related to the bare quark mass. 

The Wilson hopping matrix is 

3 

Dxy = 11(1 - J^,)U^Jyx+^ + (1 + l^)Ulx-^(^yx-f. (2) 
^l=o 

where the lattice gauge Gelds U^x ^ SU{?>) C C^^^ carry colour indices and the 
Dirac 7-matrices 7^ G C''^^: {7^, 71,} = 25^^, carry spin indices. 

Therefore M is a matrix in lattice site ® spin colour space. If the number 
of sites in the lattice is V then the fermion matrix is a complex matrix of size 
{12VY acting on vectors consisting of 12V complex numbers. 

The lattice is of finite extent, and we impose periodic or antiperiodic boundary 
conditions. 

Note that the lattice site structure of this matrix is either local or connects 
nearest-neighbour sites. Therefore it is a sparse matrix and is implemented as 
an operator rather than stored explicitly 

Other formulations of the fermion matrix include the Sheikholeslami-Wohlert 
discretisation which adds to the Wilson matrix a3®4x3®4 complex ma- 
trix diagonal in lattice sites , variations of the Kogut-Susskind matrix jE] 
which have no spin components but can have a more than nearest-neighbour 
connectivity [3 |H] and the Fixed Point formulations which have spin and colour 
indices and more than nearest-neighbour connectivity l9j. The choice depends 
on the desired properties, and has consequences for the balance of computation 
to communication required; for example, the Sheikoleslami-Wohlert matrix has 
more local computation than the Wilson matrix with no extra communication 
and will therefore scale better. 
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The Wilson matrix has the favourable property of only requiring nearest-neigh- 
bour communications but since its computational requirements are compara- 
tively modest it does put some pressure on the scaling abilities of a parallel 
machine. Although it is one of the older formulations it is still interesting not 
least because it can be used as the basis for the new breed of Ginsparg-Wilson 
lattice fermion formulations ^HIE]; which have very favourable field-theoretic 
properties but are still very expensive to compute with; a single multiplication 
by the GW fermion matrix involves many (0(100)) multiplications by the Wil- 
son matrix. For these reasons we believe that our test results with the Wilson 
matrix are relevant. However we also consider the effects of a fermion matrix 
requiring more than nearest-neighbour communications. 



3 Description of Lattice QCD software 

We give here some details of the implementation of our lattice QCD applications, 
describing the computational task itself, how the code is written and how the 
code is parallehsed. 

Much of the code used in these tests was actually developed for a specific Lattice 
QCD project called GRAL ^ The ORAL project has as its aim the finite 
size scaling analysis of QCD physics close to the chiral regime of light dynamical 
quarks using the Wilson formulation. This requires a large computational effort, 
and although a cluster is just one of the machines that is used for the project, 
good performance of the Lattice QCD code on it is important to the project's 
success. 

The bulk of the code is written in C compiled with the Compaq C Compiler 
ccc with a set of optimisations enabled with the -fast flag. We use assembly 
language kernels f3] for certain low-level numerically intensive routines where 
performance is critical. Routines from the optimised BLAS library are also used 
where appropriate. Both 32-bit and 64-bit code is implemented, but since 64-bit 
accuracy is preferred for the GRAL project, we present here only 64-bit results. 

The communications are entirely handled with message passing via MPI. The 
fast MPI-library on ALICE is provided by the ParaStation"""^ cluster middleware 

m 

1 GRAL stands for "Going Realistic And Light" . 
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3.1 Fermion matrix inversion 

As stated in section |21 most computer time will be spent in those portions of this 
code which invert the fermion matrix. This involves solving the linear system 

M'ilj = b (3) 

where M is the fermion matrix defined in section |21 6 is the source vector and 
ip is the solution to be found. A Krylov subspace algorithm is best suited to 
the solution of such a sparse linear system. We have implemented several such 
algorithms but here we show results using the BiCGStab [T^ [T3] method which 
is usually the fastest. The communication requirements (described in section 
I3.3|l of these different algorithms are comparable. 

An additional refinement is the use of a parallel (locally lexicographic) SSOR 
preconditioning [ini Ej to reduce the number of iterations of the BiCGStab 
solver. This requires a slightly different pattern of communication and so we 
present these results separately. 

3.2 Grid decomposition 

The four-dimensional lattice is decomposed over a grid of processing nodes. This 
grid can be from zero dimensions {i.e., a single node) to four dimensions (since 
the lattice is four-dimensional). The number of nodes along any dimension is 
restricted in that the local lattice on each node must have identical size and 
shape. This constraint is not strictly necessary on a MIMD machine but it 
makes the implementation simpler and ensures perfect load balancing. 

3.3 Nearest-neighbour communication requirements 

Every iteration of the solver algorithm requires three global summations of a 
complex number and performs two multiplications by the fermion matrix. 

Each fermion matrix multiplication requires nearest-neighbour communication 
of the multiplicand vector. On a lattice boundary in a non-parallelised direction 
this is taken care of by the boundary condition. In a parallelised direction the 
required data is on another node. With the message passing communication 
paradigm, this means that the data on those lattice sites on the boundary of 
a local lattice is sent to the neighbouring node. Each node receives data from 
neighbouring nodes and stores it in buffers. This communication pattern is 
illustrated in figure [H 
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The amount of data to be sent depends on the size of the lattice and the grid 
decomposition. If these are such that on each node of a grid of dimensionahty 
d > the local lattice has V sites, then the number of sites on the boundaries 
from which data must be sent is 

d-l 1 

2vj:-, (4) 

where the sum is over parallelised directions fi and A*"^ is the length of the local 
lattice in direction /i. Recall that a vector consists of 12 complex numbers on 
each lattice site, which is 192 bytes at 64-bit precision. 

For the normal BiCGStab algorithm all the data on a boundary can be sent 
at once as a single message. We send each boundary before the multiplication 
by the fermion matrix begins. The order in which they are sent is essentially 
unimportant. 

The SSOR preconditioning requires most of the data to be sent incrementally as 
a succession of smaller messages at various points within the matrix- vector multi- 
plication; the order is important. This arrangement is potentially less favourable 
because of a possibly inefficient use of available bandwidth and the effect of 
multiple latency penalties. 



3.3.1 Data layout 

The parallelisation efficiency is affected by the way the data is stored along with 
the design of the communication buffers. 

The vectors involved in the calculation are stored as site-major arrays. Each 
array element contains the 12 complex numbers and the n— th element cor- 
responds to the local lattice site with cartesian coordinates (xq, xi, 0:2, X3) via 
n = X3 + N3X2 + N3N2X1 + N3N2N1X0 where A^^ is the size of the local lattice in 
the //th direction. The gauge fields are stored in a similar way. 

This storage order is illustrated in figure [T] Note that because of the layout of 
the data in lattice, the data on the direction 1 boundaries is strided, which takes 
more time to copy than contiguous data. 

The communication buffers are, for convenience, attached to this array, which 
increases its length by an amount the size of the local lattice boundaries. Two 
arrangements of the buffers were tried: 

The "halo" layout 

The "halo" layout places the buffers on either side of the array body, e.g. for a 
local lattice of size Nq x Ni x N2 x N^, the first two buffers would be 
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Node A 




Node B Node C 



Figure 1: An illustration of the nearest-neighbour communications in two di- 
mensions of a two-dimensional local lattice of size NqX Ni. Nodes A and B show 
the communication in a 1-dimensional processor grid in the negative direction 
0: The data on Node A with coordinate = is copied to a buffer on Node B 
where it can be accessed as Node B data with coordinate Xq = Nq. The positive 
direction communication is analogous. Nodes C and B show the communica- 
tion in a 2-dimensional grid in the positive direction 1: The data on Node B with 
coordinate = A^i — 1 is copied to a buffer on Node C where it can be accesssd 
as local data with coordinate xi = —1. The negative direction 1 communication 
is analogous. Each node sends data in this way from all its boundaries. 
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Xi = —1 


Xo = -1 


n = ■ ■ ■ N0N1N2N3 - 1 


Xo = No 


xi = Ni 



The "tail" layout 

An alternative is the "tail" layout: 



n = ■ ■ ■ N0N1N2NS - 1 


Xo = -1 


Xo = No 


Xi = —1 


xi = Ni 



where the buffers are attached to the end of the array. In section 14.4.11 we test 
the differences between these two layouts. 



3.4 Next-to-nearest-neighbour communication 
requirements 

Although the Wilson fermion matrix requires only nearest-neighbour communi- 
cation of the gauge field U, for algorithmic reasons we also need next-to-nearest- 
neighbour access^. 

Furthermore, as mentioned in section d some fermion matrices connect more 
than nearest-neighbour lattice sites. We investigate the effects of this by con- 
sidering next-to-nearest-neighbour communication. This is already possible on a 
1-dimensional grid with the communication patterns already presented, but on 
a 2-dimsional grid the communication in direction 1 has to be changed as shown 
in figure 121 Additional data from the corners of the lattices has to be sent. 

For a lattice with V sites on each node of a grid of dimensionality d > 1, there 
are 

d-l l-L-l 1 

^I'EE^ (5) 

lattice sites on the corners from which data must be sent, where the sum is over 
parallelised directions // and ly and A"^ and N^, are the lengths of the local lattice 
in those directions. This is in addition to the amount of boundary data described 
in section VA.'Al We wish to test the effect of this additional communication. 



^In a typical GRAL application the gauge field is communicated 0(100) times less often 
than the spin-colour vector, therefore the efficiency of the gauge field communications is not 
so important for our considerations. 
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Node A 




Figure 2: An illustration of next-to- nearest-neighbour communications in two 
dimensions of a two-dimensional local lattice of size Nq x Ni. In addition to the 
data communications shown in figure ^ the corner of the node A lattice (coor- 
dinates Xq = Nq, Xi = Ni — 1) is sent to a buffer on node C where it can be 
accessed as data with coordinates xi = —1, xq = —1. We do this in two stages, 
sending the data via the xq = Nq buffer of node B, so the direction communi- 
cation must complete before the direction 1 communication can start. Blocking 
or semi-blocking semantics are suitable here. An alternative (not examined here) 
would be to send the corner from node A directly to node C, which in principle 
avoids this wait. 
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3.4.1 Data layout 



The next-to-nearest neighbour communication pattern in more than one dimen- 
sion requires extra buffers on each array for the corner data. The possibihties 
for different data layouts are therefore increased. We illustrate three possibilities 
for the layout on a two-dimensional grid (which can be generalised for a greater 
number of dimensions): 

Layout 

The corner buffers are added as halos to the xi = —1 and Xi = Ni buffers; 



Xi = -l,Xo = -1 



Xi = — 1 



xi = -l,Xo = Ni 



Xq 



-1 



n = 0---NoNiN2N3 - 1 



Xq = No 



Xi = Ni,xo = — 1 



xi = Ni 



xi = Ni,xo = Ni 



layout 1 

The corner buffers are added to the Xi = —1 and Xi = Ni buffers as tails; 



n = 0---NoNiN2N3 - 1 



Xo = -1 



Xo = A^o 



Xi 



Xi 



-l,Xo 



Xi 



-l,xo = A^i 



xi = A^i 



xi = A^i,xo 



xi = A^i,Xo = A^i 



layout 2 

The corner buffers are added as a tail at the end of the array; 



n = 0---NoNiN2N3 - 1 


Xo = -1 


Xo = No 


Xi = —1 


xi = A^i 



Xi = —1, Xq = —1 


Xi = -l,Xo = Ni 


xi = A^i,Xo = A^i 


xi = A^i,Xo = -1 



Layouts and 1 have the advantage that the data in the Xo = A^'o and xq = — 1 
buffers with xi = (A'^i — 1) can be sent in the negative (positive) direction 1 
along with the data with < xq < A^o, xi = (A''! — 1) as a single message. 

Layout 2 requires that the corners buffers be sent seperately. However, since it 
is merely the tail layout of section IH.H.ll with extra buffer space added to the 
end of it, a single layout can be used for fields both with and without next-to- 
nearest-neighbour interaction without wasting space. 



10 



4 The scaling tests 



Our goal is to test how well the code introduced in section El scales on a cluster. 
We run tests to see how this is affected by various factors; the data layout, the 
locality of the fermion matrix and the solver algorithm. 

In this section we first describe the essential details of all the scaling tests. We 
then describe the hardware and software on which these tests are run. 



4.1 The test specifications 

In order to test the scaling of our lattice QCD applications, we choose to test 
the fermion matrix inversion described in section I^TD since it is the area in which 
many QCD programs and in particular the GRAL applications will spend most 
run-time and it has the considerable communication requirements discussed in 
section ESI All the tests are run in 64-bit precision. 

We choose to time the inversion on a 12^ lattice. This rather arbitrary size is 
smaller than that used in most realistic lattice QCD calculations. As our machine 
size is 128 processors, this lattice size should give an idea of how the performance 
is affected by the size and surface/volume ratio of the sub-lattices on each node 
when extrapolated to larger computers with 512 or 1024 processors. 

Each solve is performed on a random SU (3) gauge field with a random gaussian 
source vector. The solve is repeated many times until a good estimate of the 
average run-time per solve can be obtained. The run-time on a single node is 
very well reproducible, but on multiple nodes there is some variation, perhaps 
due to effects of different physical grid topologies or the load on the network and 
switches, so the entire run is repeated at least five times to average over such 
variations. 

TablelUshows the grid topologies used in the tests. Results from grids of the same 
dimensionality and number of nodes are averaged. The ratio of the communicable 
surface of the local lattice to the local lattice volume is also shown. 

Our main metric of the parallel performance is the speedup; 



speedup 



speed on n nodes 



(6) 



speed on 1 node 



It can also be useful to know the parallel efEciency {scaled speedup); 



parallel efficiency = scaled speedup 



speedup 



(7) 



n 



11 



nodes 


grid 


local lattice 


surf ace/ volume 


2 


2x1x1x1 


6 x 12 x 12 x 12 


0.33 


3 


3x1x1x1 


4 X 12 X 12 X 12 


0.50 


4 


4x1x1x1 


3 X 12 X 12 X 12 


0.67 


6 


6x1x1x1 


2 X 12 X 12 X 12 


1.00 


12 


12 X 1 X 1 X 1 


1 X 12 X 12 X 12 


2.00 


4 


2x2x1x1 


6 X 6 X 12 X 12 


0.67 


9 


3x3x1x1 


4 X 4 X 12 X 12 


1.00 


12 


4x3x1x1 


3 X 4 X 12 X 12 


1.17 


12 


3x4x1x1 


4 X 3 X 12 X 12 


1.17 


16 


4x4x1x1 


3 X 3 X 12 X 12 


1.33 


36 


6x6x1x1 


2 X 2 X 12 X 12 


2.00 


8 


2x2x2x1 


6 X 6 X 6 X 12 


1.00 


27 


3x3x3x1 


4 X 4 X 4 X 12 


1.50 


36 


3x3x4x1 


4 X 4 X 3 X 12 


1.67 


64 


4x4x4x1 


3 X 3 X 3 X 12 


2.00 


16 


2x2x2x2 


6x6x6x6 


1.33 


36 


2x3x3x2 


6x4x4x6 


1.67 


64 


2x4x4x2 


6x3x3x6 


2.00 



Table 1: The grid sizes, local lattice sizes and local lattice surf ace/ volume ratios 
used in the scaling measurements. The global lattice size is 12*^ throughout our 
tests. 
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4.2 The test platform 



ALICE, the Alpha-Linux Cluster Engine at Wuppertal University is a general 
purpose high-performance computer consisting of 128 Compaq DSIO servers, 
each with a 616MHz Alpha 21264 processor, 64 kB 2-way set-associative level 1 
data and instruction caches and a 2 MB level 2 cache jIH]. The Alpha 21264 is 4- 
way superscalar, sustaining 2 floating-point operations per cycle. The operating 
system is SuSE Linux [T^ . 

ALiCE is clustered using Myrinet with the ParaStationS cluster middleware |20]. 
This arrangement takes advantage of the open architecture of the Myrinet com- 
munication hardware to replace the standard firmware supplied by Myricom 
with that supplied by Par- Tec, together with kernel extensions and communica- 
tions libraries. The ParaStation system endeavors to achieve high performance 
by implementing a reliable communications protocol in firmware and keeping 
communications in user space. The two-way 1.28 Gbit/s Myrinet network is 
configured as a multistage crossbar and the Myrinet cards have a 64 bit/33 MHz 
PCI buses. We use MPICH 1.2.3 as the interface between our application code 
and the ParaStation middleware. On ALiCE, the latency for ParaStation/MPI 
is 16/xs; the bi-directional bandwidth is about 170 MB/s fH] . 

4.3 Single node performance 

In evaluating the following results it will be helpful to know the performance 
of the solvers on a single node. Figures El and 0] show how the speed of the 
BiCGStab solver and the SSOR preconditioned BiCGStab solver vary with the 
number of sites in the lattice. Memory cache and memory bandwidth effects 
mean that performance deteriorates as the lattice volume increases. 

BiCGStab sustains a speed of 237 Mflops on our 12'^ test lattice described in 
sectionHm while SSOR preconditioned BiCGStab sustains a speed of 220 Mfiops. 
The latter solver is preferred because it still finds the solution faster due to the 
smaller number of iterations. 



4.4 The results of the scaling tests 

We now examine what happens to the performance of our solvers as we move 
from a serial to a parallel implementation. There are numerous factors which 
might affect the parallel performance; we make a preliminary investigation into 
the effects of data layout and communication semantics before presenting our 
final results for the speedup and parallel efficiency. 
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Figure 3: The speed of the BiCGStab solver on a single node as a function of 
the lattice volume. The corresponding lattice dimensions are indicated along the 
top. 
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Figure 4: The speed of the SSOR preconditioned BiCGStab solver on a single 
node as a function of the lattice volume. The corresponding lattice dimensions 
are indicated along the top. 
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Figure 5: The speedup of the BiCGStab solver using nearest-neighbour blocking 
communications on one- and two-dimensional grids, comparing the data layouts 
described in section IH.H. II 



4.4.1 Investigation of data layout 

We compare how the two data layouts described in section lH.H.ll affect the scaling 
of the data. It might be expected that the halo layout would have better spatial 
locality properties (and therefore use the cache better), but the performance, 
shown in figure seems to be much the same. This could be because of the data 
prefetching written into our assembler kernels. 

We choose to use the tail format. It is easily adapted to the case of next-to- 
nearest-neighbour communications (which we need for the gauge field) by simply 
adding extra buffer space at the end of the array. 



4.4.2 Investigation of communication semantics 

We can hope to improve performance using non-blocking (asynchronous) commu- 
nications {e.g MPI_Isend) rather than blocking (synchronous) communications 
{e.g MPI_Send). 

Figure El shows that this improves BiCGStab performance on a 1-dimensional 
grid but degrades it on a 2-dimensional grid. We try a semi-blocking approach 
where non-blocking communications are used in direction 0, then a block allows 
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Figure 6: The speedup of the BiCGStab solver comparing nearest-neighbour 
blocking and non-blocking communications. 



this communication to complete before non-blocking communications are used in 
direction 1; this aproach does not work either. We have verified that this is also 
the case for grids of dimension greater than 2. Henceforth with BiCGStab we 
use non-blocking communications on one- dimensional grids and blocking com- 
munications on grids of more than one dimension. 

Figure [3 shows that the situation is different with SSOR preconditioned BiCG- 
Stab: non-blocking communications are faster even on the two-dimensional grid. 
Henceforth we use non-blocking communications in connection with SSOR pre- 
conditioned BiCGStab. We suspect that the communication of BiCGStab, where 
long messages are sent, might negatively interfere with caching and compute pro- 
cesses. A better separation of communication directions with smaller message 
lengths, as is the case for SSOR, might lead less often to a situation in which 
data used for compute operations competes for cache hues with data used simul- 
taneously for communication 



4.5 Results for solver speedup and parallel efficiency 



Having determined that the data layout is not very significant and when it is bet- 
ter to use blocking or non-blocking communications, we can present the results 
for the best scaling of our solvers on one- to four-dimensional grids. 
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Figure 7: The speedup of the SSOR preconditioned BiCGStab solver comparing 
nearest-neighbour blocking and non-blocking communications. 
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Figure 8: The speedup of the BiCGStab solver. 
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Figure 9: The parallel efficiency of the BiCGStab solver. 
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Figure 10: The speedup of the SSOR preconditioned BiCGStab solver. 
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Figure 11: The parallel efficiency of the SSOR preconditioned BiCGStab solver. 

What we see is that for the BiCGStab solver on this test lattice (figures |H1 and IH)) 
the one- and two-dimensional grids become quite inefficient on more than a few 
nodes. On the three- and four- dimensional grids the drop in efficiency on large 
numbers of nodes is not as dramatic; a reasonable speedup is achieved due to the 
more favourable surf ace/ volume ratio. The speedup on a four-dimensional grid 
is not convincingly better than that on a three-dimensional grid, presumably 
because the data on the direction 3 boundaries is so finely strided as to negate 
the advantage of better surf ace/ volume ratios. 

The SSOR results (figures HUl and ITTj) show that the speedup is less affected 
by the dimensionality of the grid i.e. the surface/ volume ratio. However the 
fact that the data on the direction 3 boundaries are so finely strided adversely 
affects the four-dimensional grid performance, since in this case it means that 
very many send/receives have to be done, which is evidently too inefficient due 
to latency effects. 

A main result of these investigation is that a small 12^ lattice still performs with 
a scaled speedup of 0.45 on 64 processors. We will comment on this in section 

o 

To shed further light on these results we plot in figures IT^ and IT!^ the percentage 
of the total wall-clock execution time required for communication against the 
ratio of the communicable surface of the local lattice to the local lattice volume. 
The correlation does not usually depend very strongly on the dimensionality of 
the grid; a notable exception is the SSOR preconditioned solver on the four- 
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Figure 12: The dependence of the percentage of BiCGStab run-time spent in 
communication on the surf ace/ volume ratio of the local lattice. 



dimensional grids. Note also how the dependence is weaker when the solver is 
SSOR preconditioned. 



4.6 Results for next-to-nearest-neighbour communication 

We examine the overhead introduced by the need to send the extra data from the 
boundary corners by implementing next-to-nearest-neighbour communications 
and timing exactly the same BiCGStab solve of the same fermion matrix as before 
on two-dimensional grids. Any difference is therefore entirely due to the extra 
comunication. Again it was found that for BiCGStab, blocking communications 
were fastest, so these tests are performed with blocking communications. 



4.6.1 Investigation of data layout 

We compare in figure [T3I the three layouts defined in section 13.4.11 Because of 
the seperate messages required, the third layout is the least scalable, but the 
differences only become significant in the region of poor scaling. Because of this 
and the fact that for our applications the next-to-nearest-neighbour communi- 
cations are relatively unimportant, we adopt the third layout for the gauge field 
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Figure 13: The dependence of the percentage of SSOR preconditioned BiCGStab 
run-time spent in communication on the surf ace/ volume ratio of the local lattice. 



in the GRAL code, utilising the benefits of a unified data layout as described in 
section 13. 41 



4.6.2 Results for the speedup 

In figureElwe take the most scalable next-to-nearest-neighbour layout and com- 
pare its speedup with the nearest-neighbour communications on a two-dimension- 
al grid, demonstrating the effect of sending the extra data from the corners of 
the lattice. 

This is perhaps a favourable comparison since the amount of extra data that 
must be sent for next-to-nearest-neighbour communications increases with the 
dimensionality d of the node grid as described in section 13.41 



5 Summary and conclusions 

We have presented the results of tests of the scaling of the Wilson matrix solver 
for a 12"^ test lattice on ALiCE . Here we summarize those results and discuss 
their implications. 
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Figure 14: The speedup of the BiCGStab solver using next-to-nearest-neighbour 
blocking communications on two-dimensional grids, comparing the data layouts 
described in section 1^.4.11 
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Figure 15: The speedup of the BiCGStab solver using nearest- and next-to- 
nearest-neighbour blocking communications on two-dimensional grids 
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5.1 Summary 



On small numbers of nodes the BiCGStab solver scales without much dependence 
on the grid decomposition. If a minimum effiency of 75% percent is demanded 
then figure El indicates that no more than 4 nodes should be used. However it 
still maintains around 50% efficiency on about 16-36 nodes on a three- or four- 
dimensional grid, corresponding to local lattices of 6'^-4^ x 6^ or surf ace/ volume 
ratios of 1.33-1.67. Thereafter the efficiency degrades, but not too sharply. 

The SSOR preconditioned solver has similar small grid characteristics; figure ITT] 
suggests that the 75% efficiency bound is reached at around 4-9 nodes. The 
four- dimensional grids rapidly become very slow but an efficiency of around 50% 
is maintained on two and three-dimensional grids of around 36 nodes, i.e. a 
4 X 4 X 3 X 12 local lattice with surf ace/ volume ratio 1.67 or a 2^ x 12^ lattice 
with surf ace/ volume ratio 2.00. The performance degrades slowly thereafter; we 
still have 45% efficiency on 64 nodes with a 3^ x 12 local lattice. 

That the scaling performance need not be seriously affected by the data layout is 
a heartening result for code developers; we are free to organise data in whichever 
way is convenient. 

A curious feature of our system is the performance of the non-blocking commu- 
nications. It appears that if the communications in different directions are well 
separated then non-blocking semantics are an improvement. But they are a hin- 
drance if those communications are done at the same time, even when separated 
by an explicit barrier as in the semi-blocking case. We have argued above that 
cache effects might be responsible for these differences. 

One gratifying result is how well the SSOR preconditioned solver scales. Its 
algorithmic performance is excellent but its more complicated pattern of com- 
munication can in practice lead to scaling problems due to the communication 
latency. The vice of needing many smaller messages rather than a few large ones 
appears to become a virtue, since the separation of sends allows the faster non- 
blocking communications to be used. This demonstrates that a more complicated 
algorithm need not be deleterious to efficiency and performance. 

5.2 Discussion 

We see that in general the scaling ability is limited by the additional time that 
has to be used for the communications and this is essentially determined by the 
surf ace/ volume ratio. Therefore the results presented here can serve as a guide 
for planning future runs on realistic lattices of larger global size, bearing in mind 
that increasing the global size introduces additional overheads from the collective 
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communications. 



For example, the discussion in the previous section concluded that our 12^ lat- 
tice runs with 75% efficiency on 4 nodes with BiCGstab. According to table 
HI this corresponds to a surf ace/ volume ratio of |. Therefore if we require 75% 
efficiency on any lattice we should aim for grid decomposition which gives a local 
surface/ volume ratio of about |. From equation Q we obtain the constraint on 
the dimensions of the local lattice in parallel directions 

d-i ^ ^ 

E^-J (8) 

This is achieved for a 12'^ x 24 lattice with a grid of 1x2x2x2 (8 processors) 
or possibly 1x1x2x6 (12 processors). A 16^ x 32 lattice needs alx2x2x2 
(8 processors) or 1x2x2x4 (16 processors) grid. A 24^ x 48 lattice needs a 
2x2x2x4 (32 processors) grid. 

With SSOR preconditioning we might be able to use 1.0 as a target surface/ volume 
ratio, and use grids of 1x2x2x4 (16), 1x2x4x4 (32) and 1x4x4x4 
(64) respectively. A similar analysis suggests that a lattice of size 24^ can be 
expected to run with a scaled efficiency of about 0.45 on 512 processors. 

While there must be an optimum grid decomposition for a given problem and 
problem size, the complicated nature of the relationship between latency, band- 
width and message length makes finding this "sweet spot" a difficult optimisation 
problem. 

In general there is a trade-off between speed and efficiency which must be de- 
termined according to the needs of each project and the resources available; one 
would also take into account the single-node speed when deciding the local lattice 
size. 

The GRAL project typically requires a number of concurrent simulations at 
different physical parameters which will run for some considerable length of time. 
The former requirement will probably mean using a modest number of nodes on 
ALiCE for each run, it being a shared multi-purpose facility, and the latter 
requirement will favour running at a reasonable efficiency. Fortunately these 
considerations are entirely compatible and we have shown here that the required 
performance is available. We also now know that should a run need to proceed 
at greater speed, albeit less efficiently, then this is also feasible. 

The target surf ace/ volume ratio will in general change when a different sort 
of fermion matrix is used. Our measurements indicate that a fermion matrix 
with next-to-nearest-neighbour communications might introduce an additional 
communications overhead of about 10%. 
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5.3 Outlook 



It is reasonable to speculate about what these results mean, or how they would 
change, for future cluster computer platforms. A faster CPU will increase the 
proportion of time spent in communications relative to computation, but it 
should be remembered that a faster CPU will speed up the communications 
too, so the overall effect might not be too different. 

One should be aware that network and communications technology is also de- 
veloping: we can look forward to high-bandwidth PCI buses and faster networks 
[221 I2nj- In fact much of this technology exists already, and as cluster com- 
puting becomes more widespread and new network standards {e.g. InfiniBand) 
become available, current higher-performance technologies {e.g. Gigabit System 
Network) will become affordable. 

A greater network speed helps scalability, but also of importance for the devel- 
oper, in that it could change the way application code is designed, are imple- 
mentations of communications APIs that make good use of DMA technology to 
provide support for simultaneous sends/receives of messages to multiple nodes, 
or for overlapping communications with computations. 

Of great importance to performance of scientific codes are factors like data cache 
sizes and memory bandwidths. If these are greater then it is more difficult 
to achieve efficient scaling, but the benefits to the performance on each node 
might make decomposition still worthwhile. Conceivably, there could then be 
a situation where a finer grained decomposition actually improves speedup by 
moving into a regime where all local data fit into cache. 
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